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Abstract 

We study the problem of learning a latent tree graphical model where samples are avail- 
able only from a subset of variables. We propose two consistent and computationally 
efficient algorithms for learning minimal latent trees, that is, trees without any redundant 
hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not 
constrained to be leaf nodes. Our first algorithm, recursive grouping, builds the latent tree 
recursively by identifying sibling groups using so-called information distances. One of the 
main contributions of this work is our second algorithm, which we refer to as CLGrouping. 
CLGrouping starts with a pre-processing procedure in which a tree over the observed vari- 
ables is constructed. This global step groups the observed nodes that are likely to be close 
to each other in the true latent tree, thereby guiding subsequent recursive grouping (or 
equivalent procedures) on much smaller subsets of variables. This results in more accurate 
and efficient learning of latent trees. We also present regularized versions of our algorithms 
that learn latent tree approximations of arbitrary distributions. We compare the proposed 
algorithms to other methods by performing extensive numerical experiments on various 
latent tree graphical models such as hidden Markov models and star graphs. In addition, 
we demonstrate the applicability of our methods on real-world datasets by modeling the 
dependency structure of monthly stock returns in the S&P index and of the words in the 
20 newsgroups dataset. 

Keywords: Graphical Models, Hidden Variables, Latent Tree Models, Structure Learning 



1. Introduction 

The inclusion of latent variables in modeling complex phenomena and data is a well- 
recognized and a valuable construct in a variety of applications, including bioinformatics 
and computer vision, and the investigation of machine-learning methods for models with 
latent variables is a substantial and continuing direction of research. 

There are three challenging problems in learning a model with latent variables: learning 
the number of latent variables; inferring the structure of how these latent variables relate 
to each other and to the observed variables; and estimating the parameters characterizing 
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those relationships. Issues that one must consider in developing a new learning algorithm 
include developing tractable methods; incorporating the tradeoff between the fidelity to 
the given data and generalizability; deriving theoretical results on the performance of such 
algorithms; and studying applications that provide clear motivation and contexts for the 
models so learned. 

One class of models that has received considerable attention in the literature is the 
class of latent tree models, i.e., graphical models Markov on trees, in which variables at 
some nodes represent the original (observed) variables of interest while others represent 
the latent variables. The appeal of such models for computational tractability is clear: 
with a tree-structured model describing the statistical relationships, inference - processing 
noisy observations of some or all of the original variables to compute the estimates of all 
variables - is straightforward and scalable. Although the class of tree-structured models, 
with or without latent variables, is a constrained one, there are interesting applications 
that provide strong motivation for the work presented here. In particular, a very active 
avenue of research in computer vision is the use of context - e.g., the nature of a scene to 
aid the reliable recognition of objects (and at the same time to allow the recognition of 
particular objects to assist in recognizing the scene). For example, if one knows that an 
image is that of an office, then one might expect to find a desk, a monitor on that desk, and 
perhaps a computer mouse. Hence if one builds a model with a latent variable representing 
that context ("office") and uses simple, noisy detectors for different object types, one would 
expect that the detection of a desk would support the likelihood that one is looking at 
an office and through that enhance the reliability of detecting smaller objects (monitors, 
keyboa rds, mice, etc.). Work along these li nes, i ncluding; by some of the authors of this 
paper (jParikh and Chen! . 120071 : IChoi et all l2O10h . show the promise of using tree-based 



models of context. 

This paper considers the problem of learning tree-structured latent models. If all 
variables are observe d in the tree under consideration, then the well-known algorithm of 
Chow and Liu ( 19681 ) provides a tractable algorithm for performing maximum likelihood 



(ML) esti mation of th e tree structure. However, ML estimation of latent tree models is 
NP-hard (|Rochl . l200fil ). This has motivated a number of investigations of other tractable 



methods for learning such trees as well as theoretical guarantees on performance. Our work 
represents a contribution to this area of investigation. 

There are three main contributions in our paper. Firstly, by adopting a statistical 
distance-based framework, we develop two new algorithms, recursive grouping and CLGroup- 
ing, for the learning of latent trees, which applies equally well to discrete and Gaussian 
models. Secondly, we provide consistency guarantees (both structural and parametric) as 
well as very favorable computational and sample complexity characterizations for both of 
our algorithms. Thirdly, through extensive numerical experiments on both synthetic and 
real- world data, we demonstrate the superiority of our approach for a wide variety of models 
ranging from ones with very large tree diameters (e.g., hidden Markov models (HMMs)) to 
star models and complete trees L 1 ! 

Our first algorithm, which we refer to as recursive grouping, constructs a latent tree 
in a bottom-up fashion, grouping nodes into sibling groups that share the same parent 



1. A complete fc-ary tree (or fc-complete tree) is one in which all leaf nodes are at same depth and all internal 
nodes have degree k. 
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node, recursively at each level of the resulting hierarchy (and allowing for some of the 
observed variables to play roles at arbitrary levels in the resulting hierarchy). Our second 
algorithm, CLGrouping first implements a global construction step, namely producing the 
Chow-Liu tree for the observed variables without any hidden nodes. This global step then 
provides guidance for groups of observed nodes that are likely to be topologically close to 
each ot her in the latent tree, thereby guiding subsequent recursive grouping or neighbor- 
joining (jSaitou and Neil . 119871 ) computations. Each of these algorithms is consistent and 
has excellent sample and computational complexity!! 

As Pearll ( 19881 ) points out, the identification of latent tree models has some built-in 
ambiguity, as there is an entire equivalence class of models in the sense that when all latent 
variables are marginalized out, each model in this class yields the same joint distribution over 
the observed variables. For example, we can take any such latent model and add another 
hidden variable as a leaf node connected to only one other (hidden or obse rved) node. 
Hence , much as one finds in fields such as state space dynamic systems (e.g., iLuenberger 
(|l979l . Section 8)), there is a notion of minimality that is required here, and our results are 
stated in terms of consistent learning of such minimal latent models. 



1.1 Related Work 



The relevant literature on learning latent models is vast and in this section, we summarize 
the main lines of research in this area. 

The classical latent cluster models (LCM) consider multivariate distributions in which 
there exists only one latent variable an d each state of the variable corresponds to a clus- 

Hierarchical latent class (HLC) models 



ter in the data ( 



(jZhang and Kocka 



azars 



2004 



bid and Hcnrv, 1968). 



ZhaJW ET^ul » generalize these models by al- 
lowing multiple latent variables. HLC allows latent variables to have different number 
of states, but assume that all observed nodes are at the leaves of the tree. Their learn- 
ing algorithm is based on a greedy approach of making one local move at a time (e.g., 
introducing one hidden node, or replacing an edge), which is computationally expensive 
and does not have consistency guarantees. Another greedy learning algorithm called BIN 
( Harmeling and Williams . 2010l ) is computationally more efficient, but enforces that each 
internal node is hidden and has three neighboring nodes. In contrast, we fix the number of 
states in each hidden node, but allow observed nodes to be internal nodes. Our algorithms 
are guaranteed to recover the correct structure when certain (mild) conditions are met. 

Many authors als o propose reconstructing latent trees using the expectation maximiza- 
tion (EM) algorithm ( Elidan and Friedman . 2005 : Kemp and Tenenbaum . 20081 ). However, 
as with all other EM-based methods, these approaches depend on the initialization and 
suffer from the possibility of being trapped in local optima and thus no consistency guar- 
antees can be provided. At each iteration, a large number of candidate structures need to 
be evaluated, these methods assume that all observed nodes are the leaves of the tree and 



to reduce the number of candidate structures. Algorithms have been proposed (|Hsu et al 



2. As we will see, depending on the true latent tree model, one or the other of these may be more efficient. 
Roughly speaking, for smaller diameter graphs (such as the star), recursive grouping is faster, and for 
larger diameter graphs (such as an HMM), CLgrouping is more efficient. 
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2009) with sample complexity guarantees for learning HMMs under the condition that the 
joint distribution of the observed variables generated by distinct hidden states are distinct. 

The reconstruction of latent trees has been studied extensively by the phylogenetic com- 
munity where sequences of extant species ar e available an d the u nknown phylogenetic tree 
is to be inferred from these sequences. See burbin et all (|l999h for a t horough overview . 



Efficient algorithms wi th provable performance guarantees are available ( Erdos et al. . 19991 



Daskalakis et al. . 20061 ). However, the works in this area mostly assume that only the leaves 
are observed and each internal node (which is hidden) has the same degree except for the 
root. The most popular algorithm for constructing phylogenetic trees is the neighbor- joining 
(NJ) method bv lSaitou and Neil (|1987l ). Like recursive grouping, the input to the algorithm 
is a set of statistical distances between observed variables. The algorithm proceeds by 
recursively pairing two nodes that are the closest neighbors in the true latent tree and in- 
troducing a hidden node as the p arent of the two nodes. For more details on NJ, the reader 
is referred to Durbin et al. (1 19991 . Section 7.3). 



Another popular class of reconstruction methods used in the phylogenetic community is 
the family of quar tet-based distance methods ( Bandelth and Dressl . 1986 : Erdos et al. . 19991 ; 
Jiang et al.l . bOQlh l^l Quartet-based methods first construct a set of quartets for all subsets 
of four observed nodes. Subsequently, these quartets are then combined to form a latent 
tree. However, when we only have access to the samples at the observed nodes, then it is 
not straightforward to construct a latent tree from a set of quartets since the quartets may 
be not be consistent]^] In fact, it is known that the problem of de t ermin ing a latent tree that 
agrees with the max imum number of quartets is NP-hard ( Steell . Il992l ) but many heuristics 
have been proposed (jFarris . 1972 ; Sattath and Tverskv . 19771 ). Also, when o nly samples are 



avail able, quartet-based methods are usually much less accurate than NJ (|St. John et al 



20031 ) so we only compare our proposed algorithms to NJ. For further comparisons between 
(t he samp l e com plexi ty and other aspects o f) quartet methods and NJ, the reader is referred 



to 



Csurosl (120001 ) and lSt. John et al.1 (120031 ). 



Another distance-based algorithm was proposed in Pearll ( 19881 . Section 8.3.3). This 
algorithm is very similar in spirit to quartet-based methods but instead of finding quartets 
for all subsets of four observed nodes, it finds just enough quartets to determine the location 
of each observed node in the tree. Although the algorit hm is consist ent, it performs poorly 
when only the samples of observed nodes are available ( Pearl . 19881 . Section 8.3.5). 



The learning of p hylogenetic trees is related to the emerging field of network tomography 



(jCastro et all 120041 ) in which one seeks to learn characteristics (such as structure) from 



data which are only available at the end points (e.g., sources and sinks) of the network. 
However, again observations are only available at the leaf nodes and usually t he objective is 



to es t imate the delay distri butions corresponding to nodes linked by an edge (jTsang et al. 



20031 : iBhamidi et ail . l2009h . The modeling of the delay distributions is different from the 
learning of latent tree graphical models discussed in this paper. 



3. A quartet is simply an unrooted binary tree on a set of four observed nodes. 

4. The term consistent here is not the same as the estimation-theoretic one. Here, we say that a set of 
quartets is consistent if there exists a latent tree such that all quartets agree with the tree. 
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1.2 Paper Organization 

The rest of the paper is organized as follows. In Section we introduce the notations 
and terminologies used in the paper. In Section [31 we introduce the notion of information 
distances which are used to reconstruct tree models. In the following two sections, we 
make two assumptions: Firstly, the true distribution is a latent tree and secondly, perfect 
knowledge of information distance of observed variables is available. We introduce recursive 
grouping in Section [U This is followed by our second algorithm CLGrouping in Section [5j 
In Section [6j we relax the assumption that the information distances are known and develop 
sample based algorithms and at the same time provide sample complexity guarantees for 
recursive grouping and CLGrouping. We also discuss extensions of our algorithms for the 
case when the underlying model is not a tree and our goal is to learn an approximation to 
it using a latent tree model. We demonstrate the empirical performance of our algorithms 
in Section [7] and conclude the paper in Section [HJ The Appendix includes proofs for the 
theorems presented in the paper. 



2. Latent Tree Graphical Models 
2.1 Mathematical Notation 

Let G = (W, E) be an undirected graph with vertex (or node) set W = {1, . . . , M} and edge 
set E C ( 2 ) . Let nbd(z; G) and nbd[i; G] be the set of neighbors of node i and the closed 
neighborhood of i respectively, i.e., nbd[z; G] := nbd(i; G) U {i}. For a tree T = (W, E), the 
set of leaf nodes (nodes with degree 1), the maximum degree, and the diameter are denoted 
by Leaf(T), A(T), and diam(T) respectively. The path between two nodes i and j in a tree 
T = (W,E) is the set of edges connecting i and j and is denoted as Path((i, j); E). The 
depth of a node in a tree is the shortest distance (number of hops) to the leaf nodes in T. 
The parent of a node i that has depth s is the neighbor of i with depth s + 1. The set of 
child nodes of a node i with depth s is the set of neighbors of i with depth s — 1, which 
we denote as C(i). A set of nodes that share the same parent is called a sibling group. A 
family is the union of the siblings and the associated parent. 

A latent tree is a tree with node set W := VUH, the union of a set of observed nodes V 
(with m = |V|), and a set of latent (or hidden) nodes H. The effective depth 5(T; V) (with 
respect to V) is the maximum distance of a hidden node to its closest observed node, i.e., 



8{T-V) :=maxmin|Path((i,j);T)|. (1) 

2.2 Graphical Models 

An undirected graphical model ( Lauritzen . 19961 ) is a family of multivariate probability 



distributions that factorize according to a graph G = (W,E). More precisely, let X = 
(Xi, . . . ,Xm) be a random vector, where each random variable Xi, which takes on values 
in an alphabet X , corresponds to variable at node i E V. The set of edges E encodes the set 
of conditional independencies in the model. The random vector X is said to be Markov on 
G if for every i, the random variable Xi is conditionally independent of all other variables 
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given its neighbors, i.e, if p is the joint distribution^ of X, then 

p{xi\x nhA{i . G) ) =p{xi\x\i), (2) 

where Xu denotes the set of all variables^] excluding Xj. Eqn. (|2|) is known as the local 
Markov property. 

In this paper, we consider both discrete and Gaussian graphical models. For discrete 
models, the alphabet X = {1, . . . , K} is a finite set. For Gaussian graphical models, X = K 
and furthermore, without loss of generality, we assume that the mean is known to be the 
zero vector and hence, the joint distribution p(x) oc exp(— ryx T 5] _1 x) depends only on the 
covariance matrix 5]. 

An important and tractable class of graphical models is the set of tree-structured graph- 
ical models, i.e., multivariate probability distributions t hat are Markov on an undirected 
tree T = (W,E). It is known from junction tree theory ( Cowell et al. . 19991 ) that the joint 
distribution p for such a model factorizes as 

p(x 1 ,...,x M )=hp(x 1 ) n ■ w 

That is, the sets of marginal {p(xi) : i G W} and pairwise joints on the edges {p(xi,Xj) : 
G E} fully characterize the joint distribution of a tree-structured graphical model. 
A special class of a discrete tree-structured graphical models is the set of symmetric 
discrete distributions. This class of models is characterized by the fact that the pairs of 
variables (Xi, Xj) on all the edges (i,j) G E follow the conditional probability law: 

p(x 1 \x ] ) = {\- {k - 1)9 ^ :'> (4) 

FK 1 3 ' \ Oij, otherwise, v ; 

and the marginal distribution of every variable in the tree is uniform, i.e., p(xi) = 1/K for 
all Xi G X and for all i G V U H. The parameter Oij G (0, 1/K) in ([4]), which does not 
depend on the states Xj,Xj G X, is known as the crossover probability. 

Let x n := {xW,...^")} be a set of n i.i.d. samples drawn from a graphical model 
(distribution) p, Markov on a latent tree T p = (W,E p ), where W = V U H. Each sample 
x^ G X M is a length-M vector. In our setup, the learner only has access to samples 
drawn from the observed node set V, and we denote this set of sub- vectors containing 
only the elements in V, as Xy := {x^\ . . . ,Xy*'}, where each observed sample Xy G X m 
is a length-m vector. Our algorithms learn latent tree structures using the information 
distances (defined in Section [3|) between pairs of observed variables, which can be estimated 
from samples. 



2.3 Minimal Tree Extensions 

Our ultimate goal is to recover the graphical model p, i.e., the latent tree structure and its 
parameters, given n i.i.d. samples of the observed variables Xy. However, in general, there 

5. We abuse the term distribution to mean a probability mass function in the discrete case (density with 
respect to the counting measure) and a probability density function (density with respect to the Lebesgue 
measure) in the continuous case. 

6. We will use the terms node, vertex and variable interchangeably in the sequel. 
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can be multiple latent tree models which result in the same observed statistics, i.e., the same 
joint distribution py of the observed variables. We consider the class of tree models where 
it is possible to recover the latent tree model uniquely and provide necessary conditions for 
structure identifiability, i.e., the identifiability of the edge set E. 

Firstly, we limit ourselves to the scenario where all the random variables (both observed 
and latent) take values on a common alphabet X . Thus, in the Gaussian case, each hidden 
and observed variable is a univariate Gaussian. In the discrete case, each variable takes on 
values in the same finite alphabet X . Note that the model may not be identifiable if some 
of the hidden variables are allowed to have arbitrary alphabets. As an example, consider 
a discrete latent tree model with binary observed variables {K = 2). A latent tree with 
the simplest structure (fewest number of nodes) is a tree in which all m observed binary 
variables are connected to one hidden variable. If we allow the hidden variable to take on 
2 m states, then the tree can describe all possible statistics among the m observed variables, 
i.e., the joint distribution py can be arbitrary^ 

A probability distribution py(xy) is said to be tree- decomposable if it is the marginal 
(of variables in V) of a tree-structured graphical m odel p(xy,X ff). In this case, p (over 
variables in W) is said to be a tree extension of py ( Pearl 19881 ). A distribution p is said 



to have a redundant hidden node h G H if we can remove h and the marginal on the set of 
visible nodes V remains as py. T he following conditions ensure that a latent tree does not 
include a redundant hidden node ( Pearll . 19881 ) : 



(CI) Each hidden variable has at least three neighbors (which can be either hidden or 
observed). Note that this ensures that all leaf nodes are observed (although not all 
observed nodes need to be leaves). 

(C2) Any two variables connected by an edge in the tree model are neither perfectly de- 
pendent nor independent. 

Figure [TJa) shows an example of a tree satisfying (CI). If (C2), which is a condition on 
parameters, is also satisfied, then the tree in Figure [T^a) is identifiable. The tree shown in 
Figure QJb) does not satisfy (CI) because and h§ have degrees less than 3. In fact, if we 
marginalize out the hidden variables h.4 and /15, then the resulting model has the same tree 
structure as in Figure QJa). 

We assume throughout the paper that (C2) is satisfied for all probability distributions. 
Let 7>3 be the set of (latent) trees satisfying (CI). We refer to 7>3 as the set of minimal 
(or identifiable) latent trees. Minimal latent trees do not contain redundant hidden nodes. 
The distribution p (over W and Markov on some tree in 7>3) is said to be a minimal 
tree extension of py. As illustrated in Figure [H using marginalization operations, any 
non-minimal latent tree distribution can be reduced to a minimal latent tree model. 



Proposition 1 (Minimal Tree Extensions) (IPearll . Il988l . Section 8.3) 



(i) For every tree-decomposable distribution py, there exists a minimal tree extension p 
Markov on a tree T G 7>3, which is unique up to the renaming of the variables or 
their values. 



7. This follows from a elementary parameter counting argument. 
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Figure 1: Examples of minimal latent trees. Shaded nodes are observed and unshaded 
nodes are hidden, (a) An identifiable tree, (b) A non-identifiable tree because /14 
and /15 have degrees less than 3. 



(ii) For Gaussian and binary distributions, if pv is known exactly, then the minimal tree 
extension p can be recovered. 

(Hi) The structure of T is uniquely determined by the pairwise distributions of observed 
variables p(xi,Xj) for all i,j G V. 

2.4 Consistency 

We now define the notion of consistency. In Section [H we show that our latent tree learning 
algorithms are consistent. 

Definition 2 (Consistency) A latent tree reconstruction algorithm A is a map from the 
observed samples to an estimated tree T n and an estimated tree- structured graphical 
model pp. We say that a latent tree reconstruction algorithm A is structurally consistent if 
there exists a graph homomorphism^ h such that 

lim Pr(/i(f n ) ^ T p ) = 0. (5) 

n— >oo 

Furthermore, we say that A is risk consistent if to every e > 0, 

lim Pr(D(p||p n ) > e) = 0, (6) 



where D(p\\p n ) is the KL-divergence (Cover and Thomat . 200 &) between the true distrib 



tion p and the estimated distribution p 12 . 

In the following sections, we design structurally and risk consistent algorithms for (min- 
imal) Gaussian and symmetric discrete latent tree models, defined in @. Our algorithms 
use pairwise distributions between the observed nodes. However, for general discrete mod- 
els, pairwise distributions between observed nodes are, in general, not sufficient to recover 



A graph homomorphism is a mapping between graphs that respects their structure. More precisely, a 
graph homomorphism h from a graph G = (W,E) to a graph G' = (V',E'), written h : G — > G' is a 
mapping h : V — > V' such that £ E implies that (h(i), h(j)) £ E'. 
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the parameters (jChang and Hartiganl . Il99ll ). Therefore, we only prove structural consis- 
tency, as defined in ©, for general discrete latent tree models. For such distributions, we 
consider a two-step procedure for structure and parameter estimation: Firstly, we estimate 
the structure of the latent tree using the algorithms sug gested in this paper. S ubsequently, 



we use the Expectation Maximization (EM) algorithm ([Dempster et all 119771 ) to infer the 



parameters. Note that, as mentioned, risk consistency will not be guaranteed in this case. 
3. Information Distances 

The proposed algorithms in this paper receive as inputs the set of so-called (exact or es- 
timated) information distances, which are functions of the pairwise distributions. These 
quantities are defined in Section 13.11 for the two classes of tree-structured graphical models 
discussed in this paper, namely the Gaussian and discrete graphical models. We also show 
that the information distances have a particularly simple form for symmetric discrete distri- 
butions. In Section [3.21 we use the information distances to infer the relationships between 
the observed variables such as j is a child of i or i and j are siblings. 

3.1 Definitions of Information Distances 

We define information distances for Gaussian and discrete distributions and show that these 
distances are additive for tree-structured graphical models. Recall that for two random 
variables Xi and Xj , the correlation coefficient is defined as 

= CovpQ,^) 
Pl3 ' VVarpQVarpr,-)' 

For Gaussian graphical models, the information distance associated with the pair of variables 
Xi and Xj is defined as: 

d^ := -\og\pij\. (8) 

Intuitively, if the information distance dij is large, then Xi and Xj are weakly correlated 
and vice-versa. 

For discrete graphical models, let J lJ denote the joint probability matrix between Xi and 
Xj (i.e., J l J b = p(xi = a, Xj = b),a,b G X). Also let rVP be the diagonal marginal probability 
matrix of (i.e., M % aa = p{xi = a)). For discrete graphical mo dels, the inf ormation distance 
associated with the pair of variables Xi and Xj is defined as (jLakel . ll994T ): 



|detJ^| 

da := — log ; . =. (9) 
VdetlVP detJVP 

Note that for binary variables, i.e., K = 2, the value of d^ in @ reduces to the expression 
in (|8|), i.e., the information distance is a function of the correlation coefficient, defined in ([7]), 
just as in the Gaussian case. 

For symmetric discrete distributions defined in @, the information distance defined for 
discrete graphical models in ([9]) reduces to 

dij:=-(K-l)log(l-K9ij). (10) 
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Note that there is one-to-one correspondence between the information distances d^ and 
model parameters for Gaussian distributions (parametrized by the correlation coefficient pij) 
in ([8]) and the symmetric discrete distributions (parametrized by the crossover probability 
dij) in (|10P - This is, however, not true for general discrete distributions. 

Equipped with these definitions of information distances, assumption (C2) in Section [2T31 
can be rewritten as the following: There exists constants < l,u < oo, such that 

(C2) l<dij<u, V(i,j)€Ep. (11) 

Proposition 3 (Additivity of Information Dista nces) The informa tion distances dij 
defined in |P|), and OTTjJ are additive tree metrics t 'Erdos et ai , 199& ). In other words, 



if the joint probability distribution p(x) is a tree- structured graphical model Markov on the 
tree T p = (W,E p ), then the information distances are additive on T p : 

d kl = Vfc,i€W. (12) 

(i,j)ePath((fc,0;£ P ) 

The property in (|12h means that if each pair of vertices i,j €W is assigned the weight 
dij, then T p is a minimum spanning tree on W, denoted as MST(W / ;D), where D is the 
information distance matrix with elements dij for all i,j G V. 

It is straightforward to show that the information distances are additive for the Gaussian 
and symmetric discrete cases using the local Markov property of gr aphica l mod els. For 
general discrete distributions with information distance as in Q, see lLakel (Il994h for the 



proof. In the rest of the paper, we map the parameters of Gaussian and discrete distributions 
to an information distance matrix D = [dij] to unify the analyses for both cases. 

3.2 Testing Inter-Node Relationships 

In this section, we use Proposition [3] to ascertain child-parent and sibling (cf. Section 12. ip 



relationships between the variables in a latent tree-structured graphical model. To do so, 
for any three variables i,j, k G V, we define <&ijk ■= dn~ — djk to be the difference between 
the information distances d^ and djk- The following lemma suggests a simple procedure to 
identify the set of relationships between the nodes. 

Lemma 4 (Sibling Grouping) For distances dij for all i,j€.V on a tree T G 7>3, the 
following two properties on §ijk = dik — djk hold: 

(i) &ijk = d^ for all k G V \ {i,j} if and only if i is a leaf node and j is its parent. 

(ii) —dij < &ijk = &ijk' < dij for all k,k' G V \ if and only if both i and j are leaf 

nodes and they have the same parent, i.e., they belong to the same sibling group. 

The proof of the lemma uses Proposition [3] and is provided in Appendix IA.ll Given 
Lemma HI we can first determine all the values of §ijk for triples i, j, k G V. Now we can 
determine the relationship between nodes i and j as follows: Fix the pair of nodes i,j £ V 
and consider all the other nodes k G V \ {i,j}- Then, there are three possibilities for the 
set {<S>i jk : k G V\{i,j}}: 
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Figure 2: Examples for each case in TestNodeRelationships. For each edge, ei represents the 
information distance associated with the edge, (a) Case 1: §ij k = —&8 = ~~ dij f° r 
all k G V \ {i,j}- (b) Case 2: &ij k = eQ — ej ^ dij = e$ + e-j for all k G V\ {i, j} 
(c) Case 3a: <&ij k = e 4 + e2 + e3 — ej ^ $ijk' = e 4 — e2 — ez — ej. (d) Case 3b: 
$ijk = e 4 + e 5 / $ij k > = e 4 - e 5 . (e) Case 3c: $ ijk = e 5 / $ ijk > = -e 5 . 



1. = djj for all G V \ {«, j}. Then, i is a leaf node and j is a parent of i. Similarly, 
if &ijk = — d^ for all k G y \ {i, j}, j is a leaf node and i is a parent of j. 

2. &ij k is constant for all k G F \ {«, j} but noi equal to either dij or — d^. Then i and 
j are leaf nodes and they are siblings. 

3. &ijk is not equal for all k G V \ {i,j}- Then, there are three possibilities: Either 

(a) Nodes % and j are not siblings nor have a parent-child relationship or, 

(b) Nodes i and j are siblings but at least one of them is not a leaf or, 

(c) Nodes % and j have a parent-child relationship but the child is not a leaf. 

Thus, we have a simple test to determine the relationship between i and j and to ascertain 
whether i and j are leaf nodes. We call the above test TestNodeRelationships. See Figure [2] 
for examples. By running this test for all i and j, we can determine all the relationships 
among all pairs of observed variables. 

In the following section, we describe a recursive algorithm that is based on the above 
TestNodeRelationships procedure to reconstruct the entire latent tree model assuming that 
the true distance matrix D = [dij] are known. In Section[5l we provide improved algorithms 
for the learning of latent trees again assuming that D is known. Subsequently, in Section [6l 
we develop algorithms for the consistent reconstruction of latent trees when information 
distances are unknown and we have to estimate them from the samples Xy. In addition, in 
Section f6. 41 we discuss how to extend these algorithms for the case when py is not necessarily 
tree-decomposable, i.e., the original graphical model is not assumed to be a latent tree. 

4. Recursive Grouping Algorithm Given Information Distances 

This section is devoted to the development of the first algorithm for reconstructing latent 
tree models, recursive grouping (RG). At a high level, RG is a recursive procedure in which 
at each step, TestNodeRelationships is used to identify nodes that belong to the same family. 
Subsequently, RG introduces a parent node if a family of nodes (i.e., a sibling group) 
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does not contain an observed parent. This newly introduced parent node corresponds to a 
hidden node in the original unknown latent tree. Once such a parent (i.e., hidden) node h is 
introduced, the information distances from h to all other observed nodes can be computed. 

The inputs to RG are the vertex set V and the matrix of information distances D 
corresponding to a latent tree. The algorithm proceeds by recursively grouping nodes and 
adding hidden variables. In each iteration, the algorithm acts on a so-called active set of 
nodes Y, and in the process constructs a new active set Y ncw for the next iteration^ The 
steps are as follows: 

1. Initialize by setting Y := V to be the set of observed variables. 

2. Compute $>ijk = dik — djk for all i, j, k G Y. 

3. Using the TestNodeRelationships procedure, define {Ib}^ to be the coarsest parti- 
tion 10 ! of Y such that for every subset II; (with |ILJ > 2), any two nodes in 11/ are 
either siblings which are leaf nodes or they have a parent-child relationship in which 
the child is a leaf. Note that for some I, Hi may consist of a single node. Begin to 
construct the new active set by setting Y ncw ^— Uz-|nj|=i 

4. For each I = 1, . . . , L with |Ib| > 2, if II; contains a parent node u, update Y new 
Y new U {u}. Otherwise, introduce a new hidden node h, connect h (as a parent) to 
every node in lb, and set Y new Y new U {h}. 

5. Update the active set: Y Q id ^— Y and Y <— Y new . 

6. For each new hidden node h G Y, compute the information distances dhi for all I G Y 
using (|13p and (|14p described below. 

7. If \Y\ > 3, return to step 2. Otherwise, if \Y\ = 2, connect the two remaining nodes 
in Y with an edge then stop. If instead \Y\ = 1, do nothing and stop. 

We now describe how to compute the information distances in Step [6] for each new 
hidden node h G Y and all other active nodes I G Y. Let i,j G C{h) be two children of h, 
and let k G Y^d \ {i,j} be any other node in the previous active set. From Proposition [3l 
we have that dih — djh = dik — djk = &ijk and dih + djh = dij , from which we can recover the 
information distances between a previously active node i G Y^d and its new hidden parent 
h G Y as follows: 



For any other active node I G Y, we can compute dhi using a child node i G C{h) as follows: 



9. Note that the current active set is also used (in Step [6]) after the new active set has been defined. For 
clarity, we also introduce the quantity Y \d in Steps [5] and [6] 
10. Recall that a partition P of a set Y is a collection of nonempty subsets {Hi C Y}^ =1 such that U^il; = Y 
and II; n IT;/ = for all I ^ I' . A partition P is said to be coarser than another partition P' if every 
element of P' is a subset of some element of P. 



dih = g + $ijk) ■ 



(13) 




da — dih, if / G Y id, 

dik — dih — dik, otherwise, where k G C(l). 



(14) 
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Figure 3: An illustrative example of RG. Solid nodes indicate the active set Y for each 
iteration, (a) Original latent tree, (b) Output after the first iteration of RG. Red 
dotted lines indicate the subsets IT; in the partition of Y. (c) Output after the 
second iteration of RG. Note that h%, which was introduced in the first iteration, 
is an active node for the second iteration. Nodes 4,5, and 6 do not belong to the 
current active set and are represented in grey, (d) Output after the third iteration 
of RG, which is same as the original latent tree. 



Using equations (|13|) and (|14|) . we can infer all the information distances dhi between a 
newly introduced hidden node h to all other active nodes I G Y. Consequently, we have all 
the distances between all pairs of nodes in the active set Y . It can be shown that this 
algorithm recovers all minimal latent trees. The proof of the following theorem is provided 
in Appendix IA. 21 

Theorem 5 (Correctness and Computational Complexity of RG) IfT p G 7>3 and 
the matrix of information distances D (between nodes in V) is available, then RG outputs 
the true latent tree T p correctly in time 0(diam(T p )m 3 ). 

We now use a concrete example to illustrate the steps involved in RG. In Figure EJa), 
the original unknown latent tree is shown. In this tree, nodes 1,...,6 are the observed 
nodes and h±,h2,hs are the hidden nodes. We start by considering the set of observed 
nodes as active nodes Y := V = {1,...,6}. Once &ijk are computed from the given 
distances d+j, TestNodeRelationships is used to determine that Y is partitioned into four 
subsets: IF = {1}, II2 = {2,4},U3 = {5,6},U4 = {3}. The subsets Tl\ and U4 contain 
only one node. The subset II3 contains two siblings that are leaf nodes. The subset II2 
contains a parent node 2 and a child node 4, which is a leaf node. Since II3 does not 
contain a parent, we introduce a new hidden node h\ and connect h\ to 5 and 6 as shown 
in Figure 13(b). The information distances d^ and d^ can be computed using (fl~3|) . e.g., 
d*>hi = ^(^56 + ^56i)- The new active set is the union of all nodes in the single- node subsets, 
a parent node, and a new hidden node Y new = {1,2,3, h{\. Distances among the pairs of 
nodes in Y ncw can be computed using ([Til) (e.g., d\^ = di$ —d^)- In the second iteration, 
we again use TestNodeRelationships to ascertain that Y can be partitioned into IIi = {1, 2} 
and II2 = {h\, 3}. These two subsets do not have parents so hi and /13 are added to IIi 
and II2 respectively. Parent nodes hi and h% are connected to their children in IIi and 
II2 as shown in Figure [3|c). Finally, we are left with the active set as Y = {hi,h%} and 
the algorithm terminates after hi and ^3 are connected by an edge. The hitherto unknown 
latent tree is fully reconstructed as shown in Figure [3^d) . 
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A potential drawback of RG is that it involves multiple local operations, which may result 
in a high computational complexity. Indeed, from Theorem [5j the worst-case complexity is 
0(m 4 ) which occurs when T p , the true latent tree, is a hidden Markov model (HMM). This 
may be computationally prohibitive if m is large. In Section [5] we design an algorithm which 
uses a global pre-processing step to reduce the overall complexity substantially, especially 
for trees with large diameters (of which HMMs are extreme examples). 

5. CLGrouping Algorithm Given Information Distances 

In this section, we present CLGrouping, an algorithm for reconstructing latent trees more 
efficiently than RG. As in Section HI in this section, we assume that D is known; the 
extension to unknown D is discussed in Section 16.31 CLGrouping is a two-step procedure, 
the first of whi ch is a global pre-proc essing step that involves the construction of a so-called 
Chow-Liu tree ( Chow and Liu . 19681 ) over the set of observed nodes V. This step identifies 



nodes that do not belong to the same sibling group. In the second step, we complete the 
recovery of the latent tree by applying a distance-based latent tree reconstruction algorithm 
(such as RG or NJ) repeatedly on smaller subsets of nodes. We review the Chow-Liu 
algorithm in Section[5J] relate the Chow-Liu tree to the true latent tree in Section [531 derive 
a simple transformation of the Chow-Liu tree to obtain the latent tree in Section 15.31 and 
propose CLGrouping in Section 15.41 For simplicity, we focus on the Gaussian distributions 
and the symmetric discrete distributions first, and discuss the extension to general discrete 
models in Section 15.51 

5.1 A Review of the Chow-Liu Algorithm 

In this section, we review the Chow-Liu tree reconstruction procedure. To do so, define 
T(V) to be the set of trees with vertex set V and V(T(V)) to be the set of tree-structured 
graphical models whose graph has vertex set V, i.e., every q g V(T(V)) factoriz es as in (|3J). 

Given an arbitrary multivariate distribution py(x.y), Chow and Liu ( 19681 ) considered 
the following KL-divergence minimization problem: 

PCL ■= argmin D(p v \\q). (15) 

qeV(T(V)) 

That is, among all the tree-structured graphical models with vertex set V, the distribution 
Pcl is the closest one to py in terms of the KL-divergence. By using the factorization 
property in ([3]), we can easily verify that pcl is Markov on the Chow-Liu tree Xcl = (V, Eql) 
which is given by the optimization problem!^! 

T CL = argmax V L(X t ; X 3 ). (16) 



In (fT6lh L(Xi ; Xj) = D(p(xi,Xj) \ \p(xi)p(xj)) is the mutual information (jCover and Thomas 
120061 ) between ran dom variables Xj and Xj . The optimization in (116p is a max- weight span- 
ning tree problem ( Cormen et al. . 20031 ) which can be solved efficiently in time 0(m 2 logm) 



11. In (|16[1 and the rest of the paper, we adopt the following simplifying notation; If T = (V,E) and if 
& E, we will also say that 6 T. 
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using either Kruskal's algorithm ( Kruskal . 19561 ) or Prim's algorithm (Prim, 1957 ). The edge 
weights for the max-weight spanning tree are precisely the mutual information quantities 
between random variables. Note that the parameters of p in (I15p are found by setting the 
pairwise distributions pch(xi,Xj) on the edges to py(xi,Xj), i.e., pch(xi,Xj) = pyixi 



, X-j 



for all £ Eql- We now relate the Chow-Liu tree on the observed nodes and the 

information distance matrix D. 



Lemma 6 (Correspondence between Tql and MST) If pv is a Gaussian distribution 
or a symmetric discrete distribution, then the Chow-Liu tree in f)16|) reduces to the minimum 
spanning tree (MST) where the edge weights are the information distances dij, i.e., 

Tql = MST(V; D) := argmin V dy. (17) 

re7 ™ (y)er 



Lemma [6l whose proof is omitted, follows because for Gaussian and symmetric discrete 
models, the mutual informational I(X{ ; Xj) is a monotonically decreasing function of the 
information distance dy@ For other graphical models (e.g., non-symmetric discrete dis- 
tributions), this relationship is not necessarily true. See Section [5.51 for a discussion. Note 
that when all nodes are observed (i.e., W = V), Lemma [6] reduces to Proposition [3l 

5.2 Relationship between the Latent Tree and the Chow-Liu Tree (MST) 

In this section, we relate MST(y; D) in (|17p to the original latent tree T p . To relate the 
two trees, MST(y; D) and T p , we first introduce the notion of a surrogate node. 

Definition 7 ^Surrogate Node,) Given the latent tree T p = (W,E p ) and any node i £ W, 
the surrogate node of i with respect to V is defined as 

Sg(i;T p ,V) := argmin d^. (18) 
jeV 



Intuitively, the surrogate node of a hidden node h 6 H is an observed node j S V that 
is most strongly correlated to h. In other words, the information distance between h and j 
is the smallest. Note that if % & V, then Sg(i; T p , V) = i since da = 0. The map Sg(i; T p , V) 
is a many-to-one function, i.e., several nodes may have the same surrogate node, and its 
inverse is the inverse surrogate set of i denoted as 

Sg-^Tp.F) := {h e W : Sg(h;T p ,V) = i}. (19) 

When the tree T p and the observed vertex set V are understood from context, the surrogate 
node of h and the inverse surrogate set of i are abbreviated as Sg(/i) and Sg~ 1 (i) respectively. 
We now relate the original latent tree T p = (W, E p ) to the Chow-Liu tree (also termed the 
MST) MST(F; D) formed using the distance matrix D. 

12. Note that, unlike information distances dij, the mutual information quantities I(Xi ; Xj) do not form 
an additive metric on T p . 

13. For example, in the case of Gaussians, I(X, ; Xj) = -|log(l - p%) \ Cover and Thomas! , lioOrjh . 
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Lemma 8 (Properties of the MST ) The MST in (fT7|) and surrogate nodes satisfy the 
following properties: 

(i) The surrogate nodes of any two neighboring nodes in E p are neighbors in the MST, 
i.e., for all i,j G W with Sg(z) 7^ Sg(j), 

(i,j) £E p ^ (Sg(i),Sg(j)) G MST(F;D). (20) 

(ii) If j G V and h G Sg _1 (j) ; then every node along the path connecting j and h belongs 
to the inverse surrogate set Sg" 1 ^'). 

(Hi) The maximum degree of the MST satisfies 

A(MST(T/; D)) < A(T p ) 1+ f s <~ t p > v \ (21) 

where 5(T p ;V) is the effective depth defined in (pQ) and l,u are the bounds on the 
information distances on edges in T p defined in (llip . 



The proof of this result can be found in Appendix IA.3I As a result of Lemma EJ the 
properties of MST(V;D) can be expressed in terms of the original latent tree T p . For ex- 
ample, in Figure [5{a), a latent tree is shown with its corresponding surrogacy relationships, 
and Figure EJ^b) shows the corresponding MST over the observed nodes. 

The properties in Lemm a ETi-ii) can also be regarded as edge- contraction operations 
( Robinson and Foulds . 198ll ) in the original latent tree to obtain the MST. More precisely, 



an edge-contraction operation on an edge (j, h) G V x H in the latent tree T p is defined 
as the "shrinking" of (j, h) to a single node whose label is the observed node j. Thus, the 
edge (j,h) is "contracted" to a single node j. By using Lemma [S^i-ii), we observe that the 
Chow-Liu tree MST(F; D) is formed by applying edge-contraction operations sequentially 
to each (j, h) pair for all h G Sg -1 (j) n H until all pairs have been contracted to a single 
node j. For example, the MST in Figure Etb) is obtained by contracting edges (3,^3), 
(5, /12), and then (5, hi) in the latent tree in Figure [5](a) . 

The properties in Lemma [8] can be used to design efficient algorithms based on trans- 
forming the MST to obtain the latent tree T p . Note that the maximum degree of the 
MST, A(MST(V;D)), is bounded by the maximum degree in the original latent tree. The 
quantity A(MST(V;D)) determines the computational complexity of one of our proposed 
algorithms (CLGrouping) and it is small if the depth of the latent tree 5(T p ; V) is small and 
the information distances satisfy tight bounds (i.e., u/l is close to unity). The latter 
condition holds for (almost) homogeneous models in which all the information distances dij 
on the edges are almost equal. 



5.3 Chow-Liu Blind Algorithm for a Subclass of Latent Trees 

In this section, we present a simple and intuitive transformation of the Chow-Liu tree 
that produces the original latent tree. However, this algorithm, called Chow-Liu Blind (or 
CLBlind), is applicable only to a subset of latent trees called blind latent tree- structured 
graphical models "P(7biind)- Equipped with the intuition from CLBlind, we generalize it 
in Section 15.41 to design the CLGrouping algorithm that produces the correct latent tree 
structure from the MST for all minimal latent tree models. If p G 'P(Tbimd)) then its 
structure T p = (W, E p ) and the distance matrix D satisfy the following properties: 
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Figure 4: An illustration of CLBlind. The shaded nodes are the observed nodes and the 
rest are hidden nodes. The dotted lines denote surrogate mappings for the hidden 
nodes, (a) Original latent tree, which belongs to the class of blind latent graphical 
models, (b) Chow-Liu tree over the observed nodes, (c) Node 3 is the input to the 
blind transformation, (d) Output after the blind transformation, (e) Node 2 is 
the input to the blind transformation, (f) Output after the blind transformation, 
which is same as the original latent tree. 



(i) The true latent tree T p G 7>3 and all the internal nodea^l are hidden, i.e., V = 
Leaf (T p ). 

(ii) The surrogate node of (i.e., the observed node with the strongest correlation with) 
each hidden node is one of its children, i.e., Sg(/i) € C(h) for all h £ H. 

We now describe the CLBlind algorithm, which involves two main steps. Firstly, 
MST(V; D) is constructed using the distance matrix D. Secondly, we apply the blind 
transformation of the Chow-Liu tree BlindTransform(MST(V; D)), which proceeds as fol- 
lows: 

1. Identify the set of internal nodes in MST(V;D). We perform an operation for each 
internal node as follows: 

2. For internal node i, add a hidden node h to the tree. 

3. Connect an edge between h and i (which now becomes a leaf node) and also connect 
edges between h and the neighbors of i in the current tree model. 

4. Repeat steps [2] and [3] until all internal nodes have been operated on. 

See Figure H] for an illustration of CLBlind. We use the adjective blind to describe the trans- 
formation BlindTransform(MST(y; D)) since it does not depend on the distance matrix D 
but uses only the structure of the MST. The following theorem whose proof can be found 
in Appendix I A. 41 states the correctness result for CLBlind. 

Theorem 9 ^Correctness and Computational Complexity of CLBlind,) If p £ 

'P(Tbimd) is a blind tree-structured graphical model Markov on T p and the matrix of distances 
D is known, then CLBlind outputs the true latent tree T p correctly in time 0(m?logm). 

The first condition on 'P(Tbiind) that all internal nodes are hidden is not uncommon in ap- 
plications. For example, in phylogenetics, (DNA or amino acid) sequences of extant species 

14. Recall that an internal node is one whose degree is greater than or equal to 2, i.e., a non-leaf. 
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at the leaves are observed, while the sequences of the extinct species are hidden (correspond- 
ing to the internal nodes), and the evolutionary (phylogenetic) tree is to be reconstructed. 
However, the second condition is more restrictive 15 ! since it implies that each hidden node 
is connected to at least one observed node and that it is closer (i.e., more correlated) to 
one of its observed children compared to any other observed node. If the first constraint 
is satisfied but not the second, then the blind transformation BlindTransform(MST(y; D)) 
does not overestimate the number of hidden variables in the latent tree (the proof follows 
from Lemma [8] and is omitted). 

Since the computational complexity of constructing the MST is 0(m 2 log m) where 
m = \V\, and the blind transformation is at most linear in m, the overall computational 
complexity is 0(m 2 log m). Thus, CLBlind is a computationally efficient procedure com- 
pared to RG, described in Section |H 

5.4 Chow-Liu Grouping Algorithm 

Even though CLBlind is computationally efficient, it only succeeds in recovering latent trees 
for a restricted subclass of minimal latent trees. In this section, we propose an efficient 
algorithm, called CLGrouping that reconstructs all minimal latent trees. We also illustrate 
CLGrouping using an example. CLGrouping uses the properties of the MST as described 
in Lemma [8l 

At a high-level, CLGrouping involves two distinct steps: Firstly, we construct the Chow- 
Liu tree MST(y; D) over the set of observed nodes V. Secondly, we apply RG or NJ 
to reconstruct a latent subtree over the closed neighborhoods of every internal node in 
MST(F;D). If RG (respectively NJ) is used, we term the algorithm CLRG (respectively 
CLNJ). In the rest of the section, we only describe CLRG for concreteness since CLNJ 
proceeds along similar lines. Formally, CLRG proceeds as follows: 

1. Construct the Chow-Liu tree MST(V;D) as in {HJ). Set T = MST(V; D). 

2. Identify the set of internal nodes in MST(F; D). 

3. For each internal node i, let nbd[i; T] be its closed neighborhood in T and let S = 
RG(nbd[z; T], D) be the output of RG with nbd[i; T] as the set of input nodes. 

4. Replace the subtree over node set nbd[i; T] in T with S. Denote the new tree as T. 

5. Repeat steps [3] and H] until all internal nodes have been operated on. 

Note that the only difference between the algorithm we just described and CLNJ is Step [3] 
in which the subroutine NJ replaces RG. Also, observe in Step [3] that RG is only applied 
to a small subset of nodes which have been identified in Step Q] as possible neighbors in 
the true latent tree. This reduces the computational complexity of CLRG compared to 
RG as seen in the following theorem whose proof is provided in Appendix IA.5I Let \J\ := 
\V \ Leaf(MST(F; D))| < to be the number of internal nodes in the MST. 

Theorem 10 ('Correctness and Computational Complexity of CLRG,) If'T p £ 7>3 
is a minimal latent tree and the matrix of information distances D is available, then CLRG 
outputs the true latent tree T p correctly in time 0(m 2 log m + | J|A 3 (MST(F; D))). 

15. The second condition on T^Tblted) holds when the tree is (almost) homogeneous. 
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Figure 5: Illustration of CLRG. The shaded nodes are the observed nodes and the rest are 
hidden nodes. The dotted lines denote surrogate mappings for the hidden nodes 
so for example, node 3 is the surrogate of /13. (a) The original latent tree, (b) The 
Chow-Liu tree (MST) over the observed nodes V, (c) The closed neighborhood 
of node 5 is the input to RG, (d) Output after the first RG procedure, (e) The 
closed neighborhood of node 3 is the input to the second iteration of RG, (f) 
Output after the second RG procedure, which is same as the original latent tree. 



Thus, the computational complexity of CLRG is low when the latent tree T p has a small 
maximum degree and a small effective depth (such as the HMM) because (|21h implies that 
A(MST(U; D)) is also small. Indeed, we demonstrate in Section [7] that there is a significant 
speedup compared to applying RG over the entire observed node set V. 

We now illustrate CLRG using the example shown in Figure The original minimal 
latent tree T p = (W,E) is shown in Figure 0a) with W = {1, 2, . . . , 6, hi, hi, h^}. The 
set of observed nodes is V = {1, . . . , 6} and the set of hidden nodes is H = {hi, /12, ^3}- 
The Chow-Liu tree Tcl = MST(V; D) formed using the information distance matrix D is 
shown in Figure 0(b). Since nodes 3 and 5 are the only internal nodes in MST(V";D), two 
RG operations will be executed on the closed neighborhoods of each of these two nodes. In 
the first iteration, the closed neighborhood of node 5 is the input to RG. This is shown in 
Figure 0c) where nbd[4; MST(V; D)] = {1,3,4,5}, which is then replaced by the output 
of RG to obtain the tree shown in Figure 0d). In the next iteration, RG is applied to 
the closed neighborhood of node 3 in the current tree nbd[3;T] = {2, 3, 6, hi} as shown 
in Figure 0e). Note that nbd[3;T] includes h\ £ H, which was introduced by RG in the 
previous iteration. This closed neighborhood is then replaced by the output of the second 
RG operation and the original latent tree T p is obtained as shown in Figure 0f). 

Observe that the trees obtained at each iterat ion of CLRG are related to the original 
latent tree in terms of edge-contraction operations ( Robinson and Foulds . 198ll ). which were 
defined in Section [5.21 For example, the Chow-Liu tree in Figure 0b) is obtained from the 
latent tree T p in Figure 0a) by sequentially contracting all edges connecting an observed 
node to its inverse surrogate set (cf. Lemma 0ii)). Upon performing an iteration of RG, 
these contraction operations are inverted and hidden nodes are introduced. For example, 
in Figure 0d), the hidden nodes h\,h<i are introduced after performing RG on the closed 
neighborhood of node 4 on MST(V; D). These newly introduced hidden nodes in fact, turn 
out to be the inverse surrogate set of node 4, i.e., Sg _1 (5) = {5, hi, /12}. This is not merely 
a coincidence and we prove in Appendix IA. 51 that at each iteration, the set of hidden nodes 
introduced corresponds to the inverse surrogate set of the internal node. 



19 



Choi, Tan, Anandkumar, and Willsky 



We conclude this section by emphasizing that CLGrouping (i.e., CLRG or CLNJ) has 
two primary advantages. Firstly, as demonstrated in Theorem [10\ the structure of all 
minimal tree-structured graphical models can be recovered by CLGrouping in contrast to 
CLBlind. Secondly, it typically has much lower computational complexity compared to RG. 



5.5 Extension to General Discrete Models 

For general (i.e., not symmetric) discrete models, the mutual information I(Xi ; Xj) is in 
general not monotonic in the information distance dij, defined in ([9])0 As a result, LemmaH 
does not hold, i.e., the Chow-Liu tree Tql is not necessarily the same as MST(y;D). 
However, Lemma [8] does hold for all minimal latent tree models. Therefore, for general 
discrete models, we compute MST(T/; D) (instead of the Chow-Liu tree Tql with edge 
weights I(Xi ; Xj)), and apply RG or NJ to each internal node and its neighbors. This 
algorithm guarantees that the structure learned using CLGrouping is the same as T p if the 
distance matrix D is available. 



6. Sample-Based Algorithms for Learning Latent Tree Structures 

In Sections H] and [5j we designed algorithms for the exact reconstruction of latent trees as- 
suming that pv is a tree-decomposable distribution and the matrix of information distances 
D is available. In most (if not all) machine learning problems, the pairwise distributions 
p(xi, Xj) are unavailable. Consequently, D is also unavailable so RG, NJ and CLGrouping as 
stated in Sections H] and [5] are not directly applicable. In this section, we consider extending 
RG, NJ and CLGrouping to the case when only samples Xy are available. We show how to 
modify the previously proposed algorithms to accommodate ML estimated distances and 
we also provide sample complexity results for relaxed versions of RG and CLGrouping. 

ML Estimation of Information Distances 

The c anonical metho d for deterministic parameter estimation is via maximum-likelihood 
(ML) <|SerflineL [l98oh . We focus on Gaussian and symmetric discrete distributions in this 



section. The generalization to general discrete models is straightforward. For Gaussians 
graphical models, we use ML to estimate the entries of the covariance matrix0 i.e., 

1 n 

%i = -Y, x i x i > V ^' e ^ (22) 
n k=l 

The ML estimate of the correlation coefficient is defined as pij := Ejj/^jjX^) 1 / 2 . The 
estimated information distance is then given by the analogue of flS}, i.e., dij = — log | pjj \ . 
For symmetric discrete distributions, we estimate the crossover probability via ML aa 18 l 

% = - E l i x t ] * x f I' V *> 3 G V. (23) 



k=l 



16. The mutual information, however, is monotonic in dij for asymmetric binary discrete models. 

17. Recall that we assume that the mean of the true random vector X is known and equals to the zero vector 
so we do not need to subtract the empirical mean in (I22|) , 

18. We use !{•} to denote the indicator function. 
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The estimated information distance is given by the analogue of (flOj) . i.e., dij = —{K — 
1) log(l — K9ij). For both classes of models, it can easily be verified from the Central Limit 
Theorem and continuity arguments ( Serfling . 1980l ) that dij — dij = O p (n -1 / 2 ), where n 



is the number of samples. This means that the estimates of the information distances are 
consistent with rate of convergence being n" 1 / 2 . The mxm matrix of estimated information 
distances is denoted as D = [dij]. 

6.1 Relaxed Recursive Grouping (RG) Given Samples 

We now show how to relax the canonical RG algorithm described in Section 0] to handle the 
case when only D is available. Recall that RG calls the TestNodeRelationships procedure 
recursively to ascertain child-parent and sibling relationships via equality tests &ij k = di k — 
djk (cf. Section I3.2|) . These equality constraints are, in general, not satisfied with the 
estimated differences &ijk '■= dik — djk, which are computed based on the estimated distance 
in D. Besides, not all estimated distances are equally accurate. Longer distance estimates 
(i.e., lower correlation estimates) are less accurate for a given number of samples 1^1 As 
such, not all estimated distances can be used for testing inter-node relationships reliably. 
These observations motivate the following three modifications to the RG algorithm: 

1. Consider using a smaller subset of nodes to test whether &ij k is constant (across k). 

2. Apply a threshold (inequality) test to the &ijk values. 

3. Improve on the robustness of the estimated distances dih in (I13p and (I14p by averaging. 



We now describe each of these modifications in greater detail. Firstly, in the relaxed RG 
algorithm, we only compute <&ijk for those estimated distances d^, di k and dj k that are 
below a prescribed threshold r > since longer distance estimates are unreliable. As such, 
for each pair of nodes such that dij < r, associate the set 

fCij := he G V\{i,j} : max{d ik ,d jk } < r| . (24) 

This is the subset of nodes in V whose estimated distances to % and j are less than r. 
Compute <3?jjfc for all k 6 K-ij only. 

Secondly, instead of using equality tests in TestNodeRelationships to determine the rela- 
tionship between nodes i and j, we relax this test and consider the statistic 

Aij := max % ijk - min $ ijk (25) 

Intuitively, if Ay in (|25[) is close to zero, then nodes i and j are likely to be in the same 
family. Thus, declare that nodes i,j & V are in the same sibling group if 

Aij < e, (26) 



19. In fact, by using a large deviation result in IShenl (|2007l. Theorem 1), we can formally show that a larger 
number of samples is required to get a good approximation of pn- if it is small compared to when is 
large. 
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for another threshold e > 0. Similarly, an observed node k is identified as a parent node if 
\dik + dkj — dij\ < e for all i and j in the sibling group. 

Thirdly, in order to further improve on the quality of the distance estimate d^ of a 
newly introduced hidden node to observed nodes, we compute d^ using (fT3j) with different 
pairs of j G C(h) and k £ /Cy, and take the average as follows: 



(27) 




ih= 2(\C(h)\ 



Similarly, for any other node k £" C(h), we compute dkh using all child nodes in C{h) and 
C(k) (if C{k) 0) as follows: 

Y,ieC(h)(dik -d ih ), HkeV, 



d kh ={ m)\^C(h)K^ u^j, ^ ^ ^ „^y, (2g) 
I \m\\m\ ^(i,j)ec(h)xc(k)( d ij - d ih - d jk ), otherwise. 

It is easy to verify that if d^ and dkh are equal to d^ and dkh respectively, then (|27p 
and ([28]) reduce to (pT3|) and (fT^) respectively. 

The following theorem shows that relaxed RG is consistent, and with appropriately 
chosen thresholds e and t, it has the sample complexity logarithmic in the number of 
observed variables. The proof follows from standard Chernoff bounds and is provided in 
Appendix IA.6I 

Theorem 11 (Consistency and Sample Complexity of Relaxed RG) (i) Relaxed RG 
is structurally consistent for all T p £ 7>3- In addition, it is risk consistent for Gaussian and 
symmetric discrete distributions, (ii) Assume that the effective depth is 5(T p ;V) = O(l) 
(i.e., constant in m) and relaxed RG is used to reconstruct the tree given D. For every 
r] > 0, there exists thresholds e, r > such that if 

n>C log(m/^77) (29) 

for some constant C > 0, the error probability for structure reconstruction in ([5]) is bounded 
above by rj. If, in addition, p is a Gaussian or symmetric discrete distribution and n > 
C \og{m / tyrf) , the error probability for distribution reconstruction in © is also bounded 
above by ij. Thus, the sample complexity of relaxed RG, which is the number of samples 
required to achieve a desired level of accuracy, is logarithmic in m, the number of observed 
variables. 



As we observe from (|29p . the sample complexity for RG is logarithmic in m for shallow 
trees (i.e., trees where the effective depth is constant). This is in contrast to NJ where 
the sample complexity is super-polynom i al in the number of observed nodes for the HMM 
(|St. John et all 1200.4 lLacev and Chanel . l200fih . 



RG WITH /c-MEANS CLUSTERING 

In practice, if the number of samples is limited, the distance estimates dij are noisy and it 
is difficult to select the threshold e in Theorem II II to identify sibling nodes reliably. In our 
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experiments, we employ a modified version of the /c-means clustering algorithm to cluster 
a set of nodes with small Ay, defined in (|25p . as a family. Recall that we test each Ay 
locally with a fixed threshold e in (1261) . In contrast, the A;- means algorithm provides a global 
scheme and circum vents the need to select the threshold e. We adopt the silhouette method 
( Rousseeuw . 19871 ) with dissimilarity measure Ay to select optimal the number of clusters k. 



6.2 Relaxed Neighbor-Joining Given Samples 



In this section, we describe how NJ can be relaxed when the true distances are unavailable. 
We relax the NJ algorithm by using ML estimates of the distances dij in place of unavailable 
distances dy. NJ typically assume that all observed nodes are at the leaves of the latent 
tree, so after learning the latent tree, we perform the following post-processing step: If there 
exists an edge (i, h) £ W x H with d^ < e' (for a given threshold e' > 0), then (i,h) is 
contracted to a single no de whose label is i. The sample complexity of NJ is known to be 
0(exp(diam(T p )) log m) (|St. John et all I2003T ) and thus does not scale well when the latent 
tree T p has a large diameter. Comparisons between the sample comple x ities of other closel y 
related latent tree learning algorithms a re discussed in lAttesonl (jl999h : lErdos et al.1 (Il999h : 
Csurosl (boonh and ISt. John et al.l l|200al ). 



6.3 Relaxed CLGrouping Given Samples 

In this section, we discuss how to modify CLGrouping (CLRG and CLNG) when we only 
have access to the estimated information distance D. The relaxed version of CLGrouping 
differs from CLGrouping in two main aspects. Firstly, we replace the edge weights in the 
construction of the MST in (|17|) with the estimated information distances dij, i.e., 

f CL = MST(y;D) := argmin V (30) 

The procedure in (|30p can be shown to be equivalent to the learning of the ML tree structure 
given samples Xy if py is a Gaussian or symmetric discrete distributional It has also 
been shown that the error probability of structure learning P r fTcT tU^ol) converges to 
zero exponentially fast in the number of samples n ( Tan et all 2009 . 2010l ). Secondly, for 



CLRG (respectively CLNJ), we replace RG (respectively NJ) with the relaxed version of 
RG (respectively NJ). The sample complexity result of CLRG (and its proof) is similar to 
Theorem 1 1 1 1 and the proof is provided in Appendix IA.7I 



Theorem 12 (Consistency and Sample Complexity of Relaxed CLRG) (i) Re- 
laxed CLRG is structurally consistent for all T p £ 7>3- In addition, it is risk consistent 
for Gaussian and symmetric discrete distributions, (ii) Assume that the effective depth is 
5(T p ;V) = 0(1) (i.e., constant in m). Then the sample complexity of relaxed CLRG is 
logarithmic in m. 

After CLRG has been completed, as a final post-processing step, if we find that there 
exists an estimated distance dih on an edge with i £ W and h € H in the learned model 

20. This follows from the observation that the ML search for the optimal structure is equivalent to the 
KL-divergence minimization problem in (|15[) with pv replaced by pv, the empirical distribution of Xy. 
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which is smaller than some e' > (which we specify in our experiments), then edge (i,h) 
is contracted to the single node i. This is similar to relaxed NJ and serves to contract all 
strong edges in the learned model. 

6.4 Regularized CLGrouping for Learning Latent Tree Approximations 

For many practical applications, it is of interest to learn a latent tree that approximates the 
given empirical distribution. In general, introducing more hidden variables enables better 
fitting to the empirical distribution, but it increases the model com plexity and may lead 
to overfitting. The Bayesian Information Criterion ( Schwarz . 19781 ) provides a trade-off 



between model fitting and model complexity, and is defined as follows: 

BIC(f) = logp(x£;f) - ^logn (31) 

where T is a latent tree structure and k(T) is the number of free parameters, which grows 
linearly with the number of hidden variables because T is a tree. Here, we describe regu- 
larized CLGrouping, in which we use the BIC in (|3ip to specify a stopping criterion on the 
number of hidden variables added. 

For each internal node and its neighbors in the Chow-Liu tree, we use relaxed NJ or RG 
to learn a latent subtree. Unlike in regular CLGrouping, before we integrate this subtree 
into our model, we compute its BIC score. Computing the BIC score requires estimating 
the maximum likelihood parameters for the models, so for general discrete distributions, 
we run the EM algorithm on the subtree to estimate the parameters 1^1 After we compute 
the BIC scores for all subtrees corresponding to all internal nodes in the Chow-Liu tree, we 
choose the subtree that results in the highest BIC score and incorporate that subtree into 
the current tree model. 

The BIC score can be computed efficiently on a tree model with a few hidden variables. 
Thus, for computational efficiency, each time a set of hidden nodes is added to the model, 
we generate samples of hidden nodes conditioned on the samples of observed nodes, and 
use these augmented samples to compute the BIC score approximately when we evaluate 
the next subtree to be integrated in the model. 

If none of the subtrees increases the BIC score (i.e., the current tree has the highest 
BIC score), the procedure stops and outputs the estimated latent tree. Alternatively, if we 
wish to learn a latent tree with a given number of hidden nodes, we can used the BIC-based 
procedure mentioned in the previous paragraph to learn subtrees until the desired number 
of hidden nodes is introduced. Depending on whether we use NJ or RG as the subroutine, 
we denote the specific regularized CLGrouping algorithm as regCLNJ or regCLRG. 

This approach of using an approximation of the BIC score has been commonly used to 



learn a graphical model with hidden variables (jElidan and Friedmanl . l2005l ; IZhang and Kocka 



20041 ). However, for these algorithms, the BIC score needs to be evaluated for a large subset 



of nodes, whereas in CLGrouping, the Chow-Liu tree among observed variables prunes out 
many subsets, so we need to evaluate BIC scores only for a small number of candidate 
subsets (the number of internal nodes in the Chow-Liu tree). 



21. Note that for Gaussian and symmetric discrete distributions, the model parameters can be recovered 
from information distances directly using (O or <jT0j . 
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(a) Double star 



(b) HMM 




(c) 5-complete 

Figure 6: Latent tree structures used in our simulations. 



7. Experimental Results 

In this section, we compare the performances of various latent tree learning algorithms. 
We first show simulation results on synthetic datasets with known latent tree structures 
to demonstrate the consistency of our algorithms. We also analyze the performance of 
these algorithms when we change the underlying latent tree structures. Then, we show that 
our algorithms can approximate arbitrary multivariate probability distributions with latent 
trees by applying them to two real-world datasets, a monthly stock returns example and 
the 20 newsgroups dataset. 



7.1 Simulations using Synthetic Datasets 

In order to analyze the performances of different tree reconstruction algorithms, we generate 
samples from known latent tree structures with varying sample sizes and apply recq n struc - 
tion algorithms. We compare the neighbor-joining method (NJ) (jSaitou and Neil . 1 19871 ) 
with recursive grouping (RG), Chow-Liu Neighbor Joining (CLNJ), and Chow-Liu Recur- 
sive Grouping (CLRG). Since the algorithms are given only samples of observed variables, 
we use the sample-based algorithms described in Section [6J For all our experiments, we use 
the same edge-contraction threshold e' = — log 0.9 (see Sections 16.21 and I6.3p . and set r in 
Section 16.11 to grow logarithmically with the number of samples. 

Figure [6] shows the three latent tree structures used in our simulations. The double- 
star has 2 hidden and 80 observed nodes, the HMM has 78 hidden and 80 observed nodes, 
and the 5-complete tree has 25 hidden and 81 observed nodes including the root node. For 
simplicity, we present simulation results only on Gaussian models but note that the behavior 
on discrete models is similar. All correlation coefficients on the edges pij were independently 
drawn from a uniform distribution supported on [0.2, 0.8]. The performance of each method 
is measured by averaging over 200 independent runs with different parameters. We use the 
following performance metrics to quantify the performance of each algorithm in Figure [7J 
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Struture Recovery Error Rate Robinson-Foulds Metric Error in Hidden Variables 



log 10 (KL-divergence) 




1k 2k 10k 20k 100k 200k 1k 2k 10k 20k 100k 200k 

Number of Samples Number of Samples 



10k 20k 100k 200k 

Number of Samples 



1k 2k 10k 20k 100k 200k 

Number of Samples 



Figure 7: Performance of RG, NJ, CLRG, and CLNJ for the latent trees shown in Figure El 



(i) Structure recovery error rate: This is the proportion of times that the proposed 
algorithm fails to recover the true latent tree structure. Note that this is a very strict 
measure since even a single wrong hidden node or misplaced edge results in an error 
for the entire structure. 



(ii) Robinson Foulds metric (jRobinson and Fouldd . fl98lh : This popular phylogenetic 



tree-distortion metric computes the number of graph transformations (edge contrac- 
tion or expansion) needed to be applied to the estimated graph in order to get the 
correct structure. This metric quantifies the difference in the structures of the esti- 
mated and true models. 

(hi) Error in the number of hidden variables: We compute the average number of 
hidden variables introduced by each method and plot the absolute difference between 
the average estimated hidden variables and the number of hidden variables in the true 
structure. 
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RG 


NJ 


CLRG 


CLNJ 


HMM 


10.16 


0.02 


0.10 


0.05 


5-complete 


7.91 


0.02 


0.26 


0.06 


Double star 


1.43 


0.01 


0.76 


0.20 



Table 1: Average running time of each algorithm in seconds. 



(iv) KL-divergence D{py \ \ Py): This is a measure of the distance between the estimated 
and the true models over the set of observed nodes Vr 2 \ 



We first note that from the structural error rate plots that the double star is the easiest 
structure to recover and the 5-complete tree is the hardest. In general, given the same 
number of observed variables, a latent tree with more hidden variables or larger effective 
depth (see Section [2]) is more difficult to recover. 

For the double star, RG clearly outperforms all other methods. With only 1,000 sam- 
ples, it recovers the true structure exactly in all 200 runs. On the other hand, CLGrouping 
performs significantly better than RG for the HMM. There are two reas ons for such per - 



2010) 



formance differences. Firstly, for Gaussian distributions, it was shown (jTan et al. 
that given the same number of variables and their samples, the Chow-Liu algorithm is most 
accurate for a chain and least accurate for a star. Since the Chow-Liu tree of a latent double 
star graph is close to a star, and the Chow-Liu tree of a latent HMM is close to a chain, the 
Chow-Liu tree tend to be more accurate for the HMM than for the double star. Secondly, 
the internal nodes in the Chow-Liu tree of the HMM tend to have small degrees, so we can 
apply RG or NJ to a very small neighborhood, which results in a significant improvement 
in both accuracy and computational complexity. 

Note that NJ is particularly poor at recovering the HMM structure. In fact, it has 
been shown that even if the number of samples grows polynomially with the number of 
observed variable s (i.e., n = Q(m B ) for any B > 0), it is insufficient for NJ to recover 
HMM structures ( Lacev and Chanel . 20061 ). The 5-complete tree has two layers of hidden 
nodes, making it very difficult to recover the exact structure using any method. CLNJ has 
the best structure recovery error rate and KL divergence, while CLRG has the smallest 
Robinson- Foulds metric. 

Table [T] shows the running time of each algorithm averaged over 200 runs and all sample 
sizes. All algorithms are implemented in MATLAB. As expected, we observe that CLRG is 
significantly faster than RG for HMM and 5-complete graphs. NJ is fastest, but CLNJ is 
also very efficient and leads to much more accurate reconstruction of latent trees. 

Based on the simulation results, we conclude that for a latent tree with a few hidden 
variables, RG is most accurate, and for a latent tree with a large diameter, CLNJ performs 
the best. A latent tree with multiple layers of hidden variables is more difficult to recover 
correctly using any method, and CLNJ and CLRG outperform NJ and RG. 



22. Note that this is not the same quantity as in ((6]l because if the number of hidden variables is estimated 
incorrectly, D(p\\p n ) is infinite so we plot D(pv \\Pv) instead. However, for Gaussian and symmetric 
discrete distributions, D(p converges to zero in probability since the number of hidden variables is 
estimated correctly asymptotically. 
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Log-Likelihood 


BIC 


# Hidden 


# Parameters 


Time (sees) 


CL 


-13,321 


-13,547 





84 


0.15 


NJ 


-12,400 


-12,747 


45 


129 


0.02 


RG 


-14,042 


-14,300 


12 


96 


21.15 


CLNJ 


-11,990 


-12,294 


29 


113 


0.24 


CLRG 


-12,879 


-13,174 


26 


110 


0.40 



Table 2: Comparison of the log- likelihood, BIC, number of hidden variables introduced, 
number of parameters, and running time for the monthly stock returns example. 

BIC score 

-12,000 
-12,500 
-13,000 
-13,500 
-14,000 
-14,500 

CL NJ RG CLNJ CLRG 

Figure 8: Plot of BIC scores for the monthly stock returns example. 

7.2 Monthly Stock Returns 

We apply our latent tree learning algorithms to model the dependency structure of monthly 
stock returns of 84 companies in the S&P 100 stock indexP^ We use the samples of the 
monthly returns from 1990 to 2007. As shown in Table [5] and Figure El CLNJ achieves 
the highest log-likelihood and BIC scores. NJ introduces more hidden variables than CLNJ 
and has lower log-likelihoods, which implies that starting from a Chow-Liu tree helps to 
get a better latent tree approximation. Figure [11] shows the latent tree structure learned 
using the CLNJ method. Each observed node is labeled with the ticker of the company. 
Note that related companies are closely located on the tree. Many hidden nodes can be 
interpreted as industries or divisions. For example, hi has Verizon, Sprint, and T-mobile 
as descendants, and can be interpreted as the telecom industry, and h3 correspond to the 
technology division with companies such as Microsoft, Apple, and IBM as descendants. 
Nodes h26 and h27 group commercial banks together, and h25 has all retail stores as child 
nodes. 

7.3 20 Newsgroups with 100 Words 

For our last experiment, we apply our latent tree learning algorithms to the 20 News- 
groups dataset with 100 words r 4 l The dataset consists of 16,242 binary samples of 100 

23. We disregard 16 companies that have been listed on S&P 100 only after 1990. 

24. http : //cs .nyu. edu/~roweis/data/20news_wl00 .mat 
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words, indicating whether each word appears in each posting or not. In addition to the 
Chow-Liu tree (CL), NJ, RG, CLNJ, and CLRG, we also compare the performances with 
the regCLNJ and regCLRG (described in Section I6.4p . the latent cluster model (LCM) 
(ILazarsfeld and Henrvl. I1968T). and B IN, which is a greedy algorithm for learning latent 
trees (jHarmeling and Williamsl . l20ld ). 

Table [3] shows the performance of different algorithms, and Figure [9] plots the BIC 
score. We use the MATLAB code (a small part of it is implemented in C) provided by 
Harmeling and Williamsl (|201oH ^I to run LCM and BIN. Note that although LCM has only 



one hidden node, the hidden node has 16 state s, resulting in many parameters. We also 
tried to run the algorithm by Chen et al. ( 20081 ). but their JAVA implementation on this 



dataset did not complete even after several days. For NJ, RG, CLNJ, and CLRG, we 
learned the structures using only information distances (defined in Q) and then used the 
EM algorithm to fit the parameters. For regCLNJ and regCLRG, the model parameters are 
learned during the structure learning procedure by running the EM algorithm locally, and 
once the structure learning is over, we refine the parameters by running the EM algorithm 
for the entire latent tree. All methods are implemented in MATLAB except the E-step of 
the EM algorithm, which is implemented in C++. 

Despite having many parameters, the models learned via LCM have the best BIC score. 
However, it does not reveal any interesting structure and is computationally more expensive 
to learn. In addition, it may result in overfitting. In order to show this, we split the dataset 
randomly and use half as the training set and the other half as the test set. Table H] shows 
the performance of applying the latent trees learned from the training set to the test set, 
and Figure [10] shows the log-likelihood on the training and the test sets. For LCM, the test 
log-likelihood drops significantly compared to the training log-likelihood, indicating that 
LCM is overfitting the training data. NJ, CLNJ, and CLRG achieve high log-likelihood 
scores on the test set. Although regCLNJ and regCLRG do not result in a better BIC 
score, they introduce fewer hidden variables, which is desirable if we wish to learn a latent 
tree with small computational complexity, or if we wish to discover a few hidden variables 
that are meaningful in explaining the dependencies of observed variables. 

Figure [T2l shows the latent tree structure learned using regCLRG from the entire dataset. 
Many hidden variables in the tree can be interpreted as topics - h5 as sports, h9 as computer 
technology, hl3 as medical, etc. Note that some words have multiple meanings and appear 
in different topics - e.g., program can be used in the phrase "space program" as well as 
"computer program", and win may indicate the windows operating system or winning in 
sports games. 



8. Conclusion 

In this paper, we proposed algorithms to learn a latent tree model from the information 
distances of observed variables. Our first algorithm, recursive grouping, identifies sibling 
and parent-child relationships and introduces hidden nodes recursively. Our second al- 
gorithm, CLGrouping, first learns the Chow-Liu tree among observed variables and then 
applies latent-tree-learning subroutines such as recursive grouping or neighbor joining lo- 
cally to each internal node in the Chow-Liu tree and its neighbors. These algorithms are 

25. http : //people .kyb. tuebingen.mpg .de/harmeling/code/ltt-1 . 3 .tar 
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Log-Likelihood 


BIC 


Hidden 


Params 


Time (s) 


Total 


Structure 


EM 


CL 


-238,713 


-239,677 





199 


8.9 






LCM 


-223,096 


-230,925 


1 


1,615 


8,835.9 






BIN 


-232,042 


-233,952 


98 


394 


3022.6 






NJ 


-230,575 


-232,257 


74 


347 


1611.2 


3.3 


1608.2 


RG 


-239,619 


-240,875 


30 


259 


927.1 


30.8 


896.4 


CLNJ 


-230,858 


-232,540 


74 


347 


1479.6 


2.7 


1476.8 


CLRG 


-231,279 


-232,738 


51 


301 


1224.6 


3.1 


1224.6 


regCLNJ 


-235,326 


-236,553 


27 


253 


630.8 


449.7 


181.1 


regCLRG 


-234,012 


-235,229 


26 


251 


606.9 


493.0 


113.9 



Table 3: Comparison between various algorithms on the newsgroup set. 



-232,000 
-234,000 
-236,000 
-238,000 
-240,000 
-242,000 

CL LCM BIN NJ RG CLNJ CLRG regCLNJ regCLRG 

Figure 9: The BIC scores of various algorithms on the newsgroup set. 




structurally consistent (and risk consistent as well in the case of Gaussian and discrete sym- 
metric distributions), and have sample complexity logarithmic in the number of observed 
variables. 

Using simulations with synthetic datasets, we showed that RG performs well when the 
number of hidden variables is small, and that CLGrouping performs significantly better 
than other algorithms when there are many hidden variables in the latent tree. Using both 
Gaussian and discrete real-world data, we compared the performances of our algorithms 
to other EM-based approaches and the neighbor-joining method, and our algorithm show 
superior results in both accuracy (measured by KL-divergence and graph distance) and 
computational efficiency. In addition, we introduced regularized CLGrouping, which is 
useful in learning a latent tree approximation with a given number of hidden nodes. The 
MATLAB implementation of our algorithms can be downloaded from the project webpage 
http : //people . csail .mit . edu/myungj in/latentTree .html, 
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Time (s) 
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-121,003 
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3.0 






LCM 


-112,746 


-117,288 


-116,884 


-120,949 


1 


1,009 


3,197.7 






BIN 


-117,172 


-118,675 


-117,957 


-119,460 


78 


334 


1,331.3 






NJ 


-115,319 


-116,908 


-116,011 


-117,600 


77 


353 


802.8 


1.3 


801.5 


RG 


-118,280 


-119,248 


-119,181 


-120,149 


8 


215 


137.6 


7.6 


130.0 


CLNJ 


-115,372 


-116,987 


-116,036 


-117,652 


80 


359 


648.0 


1.5 


646.5 


CLRG 


-115,565 


-116,920 


-116,199 


-117,554 


51 


301 


506.0 


1.7 


504.3 


regCLNJ 


-117,723 


-118,924 


-118,606 


-119,808 


34 


267 


425.5 


251.3 


174.2 


regCLRG 


-116,980 


-118,119 


-117,652 


-118,791 


27 


253 


285.7 


236.5 


49.2 



Table 4: Comparison between various algorithms on the newsgroup dataset with a 
train/test split. 



Log-likelihood 




CL LCM BIN NI RG CLNJ CLRG regCLNJ reeCLRG 



Figure 10: Train and test log-likelihood scores of various algorithms on the newsgroup 
dataset with a train/test split. 



Appendix A. Proofs 

A.l Proof of Lemma [4} Sibling Grouping 

We prove statement (i) in Lemma H] using (|12p in Proposition [3l Statement (ii) follows 
along similar lines and its proof is omitted for brevity. 

// : From the additive property of information distances in (|12p . if i is a leaf node and 
j is its parent, d^ = d^ + djk and thus &ijk = dij for all k / i, j. 

Only If: Now assume that 3>j,& = dij for all k £ V \ {i,j}- In order to prove that i 
is a leaf node and j is its parent, assume to the contrary, that i and j are not connected 
with an edge, then there exists a node u / i,j on the path connecting i and j. If u G V, 
then let k = u. Otherwise, let h be an observed node in the subtree away from i and j 
(see Figure IT3lfa)). which exists since T p € 7>3- By the additive property of information 
distances in (]12p and the assumption that all distances are positive, 

dij — di u + d u j > di u d u j — dik d^j — ^ijk (*^J 
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(a) 



(b) 



(d) 



Figure 13: Shaded nodes indicate observed nodes and the rest indicate hidden nodes. (a),(b) 
Figures for Proof of Lemma [H Dashed red line represent the subtrees away from 
i and j. (c) Figure for Proof of Lemma E^i). (d) Figure for Proof of Lemma [8]^ il) 



which is a contradiction. If i is not a leaf node in T p , then there exist a node u 7^ i,j such 
that (i,u) G E p . Let k = u if u G V, otherwise, let k be an observed node in the subtree 
away from i and j (see Figure [TBTb)). Then, 



A. 2 Proof of Theorem [5} Correctness and Computational Complexity of RG 

The correctness of RG follows from the following observations: Firstly, at each iteration 
of RG, the sibling groups are identified correctly by Lemma [H Since the new parent node 
added to a partition which does not contain an observed parent corresponds to a hidden 
node (in the original latent tree), a subforest of T p is recovered at each iteration. Secondly, 
from Proposition [3l for all i,j in the active set Y, the information distances can be 
computed exactly with (|13f) and (|14p . Thirdly, since the information distances between 
nodes in the updated active set Y are known, they can now be regarded as observed nodes. 
Subforests of T p are constructed at each iteration and when \Y\ < 2, the entire latent tree 
is recovered. 

The computational complexity follows from the fact there are a maximum of 0(m 3 ) 
differences §>ijk = dik — djk that we have to compute at each iteration of RG. Furthermore, 
there are at most diam(T p ) subsets in the coarsest partition (cf. step [3]) of Y at the first 
iteration, and the number of subsets reduce at least by 2 from one iteration to the next due 
to the assumption that T p G 7>3- This proves the claim that the computational complexity 
is upper bounded by 0(diam(T p )m 3 ). □ 

A.3 Proof of Lemma [8} Properties of the MST 

(i) For an edge € E p such that Sg(i) / Sg(j'), let V{\j C V and Vju C V denote 

observed nodes in the subtrees obtained by the removal of edge where the former 

includes node i and excludes node j and vice versa (see Figure [T3lfc) ) . Using part (ii) of the 
lemma and the fat that Sg(i) / Sg(j), it can be shown that Sg(i) G V^j and Sg(j) G Vj\%- 
Since (i, j) lies on the unique path from k to I on T p , for all observed nodes k G Vi\j,l G Vju, 



®ijk — da- — d^ 




(33) 



which is again a contradiction. Therefore, G E p and i is a leaf node. 



□ 
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we have 

dki = dki + dij + dji > dsg(i),t + dij + ^Sg(j)j = dsg(t),Sgtf)> (34) 

where the inequality is from the definition of surrogacy and the final equality uses the fact 
that Sg(i) 7^ Sg(j). By using the property of the MST that (Sg(i), Sg(j')) is the shortest 
edge from V^j to Vj\t, we have (|20|) . 

(ii) First assume that we have a tie-breaking rule consistent across all hidden nodes so 
that if d u h = d v h = min^y dih and d u h> = d v y = mhijgy <kh' then both h and h' choose the 
same surrogate node. Let j G V , /i G Sg _1 (j), and let m be a node on the path connecting 
/i and j (see Figure PT37 d)). Assume that Sg(u) = k 7^ j. If d U j > d u k, then 

<4j = dhu + dnj > <4« + d u k = dhk, (35) 

which is a contradiction since j = Sg(h). If <i u j = d u k, then d^- = d^k, which is again a 
contradiction to the consistent tie-breaking rule. Thus, the surrogate node of u is j. 

(iii) First we claim that 

ISg- 1 ^! < A(T p )f s ^ v \ (36) 

To prove this claim, let 7 be the longest (worst-case) graph distance of any hidden node 
h G H from its surrogate, i.e., 

7 := max|Path(/i,Sg(/i);r p )|. (37) 

heH 

From the degree bound, for each i G V, there are at most A(T P ) 7 hidden nodes that are 
within the graph distance of 7@ so 

\Sg~\i)\ < A(T p y (38) 

for all i G V. Let d* := max^/f d/i,s g (fc,) be the longest (worst-case) information distance 
between a hidden node and its surrogate. From the bounds on the information distances, 
£7 < d*. In addition, for each h G H, let z(h) := argmin je y |Path((/i, j); T p )\ be the 
observed node that is closest to h in graph distance. Then, by definition of the effective 
depth, dh£g(h) < d^ z (h\ < u5 for all h G H, and we have d* < uS. Since Z7 < d* < uS, we 
also have 

7 < uS/l. (39) 



Combining this result with ()38f) establishes the claim in f)36f) . Now consider 

A(MST(y;D)) < A(T P ) max |Sg— < A(T p ) 1+ t (40) 

where (a) is a result of the application of (|20f) and (6) results from (|36|) . This completes 
the proof of the claim in (|2ip in Lemma [HJ □ 



26. The maximum size of the inverse surrogate set in H37H is attained by a A(T p )-ary complete tree. 
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Figure 14: Figure for Proof of Theorem [Toj (a) Original latent tree, (b) Illustration of 
CLGrouping. (c) Illustration of the trees constructed using edge contractions. 



A. 4 Proof of Theorem [9} Correctness and Computational Complexity of 
CLBlind 

It suffices to show that the Chow-Liu tree MST(V; d) is a transformation of the true latent 
tree T p (with parameters such that p € 7'(7biind)) as follows: contract the edge connecting 
each hidden variable h with its surrogate node Sg(h) (one of its children and a leaf by 
assumption) . Note that the blind transformation on the MST is merely the inverse mapping 
of the above. From (I20p . all the children of a hidden node h, except its surrogate Sg(/i), 
are neighbors of its surrogate node Sg(/i) in MST(V;d). Moreover, these children of h 
which are not surrogates of any hidden nodes are leaf nodes in the MST. Similarly for 
two hidden nodes h\,h,2 E H such that {h\,h,2) £ E p , (Sg(h±), Sg(/i2)) £ MST(F;d) from 
Lemma E^i). Hence, CLBlind outputs the correct tree structure T p . The computational 
complexity follows from the fact that the blind transformation is linear in the number of 
internal nodes, which is less than the number of observed nodes, and that learning the 
Chow-Liu tree takes 0(m 2 log m) operations. □ 

A. 5 Proof of Theorem llOt Correctness and Computational Complexity of 
CLRG 

We first define some new notations. 

Notation: Let I := V \ Leaf (MST (V; d)) be the set of internal nodes. Let v r E I 
be the internal node visited at iteration r, and let H r be all hidden nodes in the inverse 
surrogate set Sg _1 (v r ), i.e., H r = Sg -1 (i/ r ) \ {v r }. Let A r := nbd[t> r ; T r_1 ], and hence A r is 
the node set input to the recursive grouping routine at iteration r, and let RG(^4 r , d) be the 
output latent tree learned by recursive grouping. Define T r as the tree output at the end 
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of r iterations of CLGrouping. Let V r := {v r+1 ,v r+2 , . . . , v^} be the set of internal nodes 
that have not yet been visited by CLGrouping at the end of r iterations. Let EC(T P , V r ) be 
the tree constructed using edge contractions as follows: in the latent tree T p , we contract 
edges corresponding to each node u £ V r and all hidden nodes in its inverse surrogate set 
Sg -1 (it). Let S r be a subtree of EC(T p , V r ) spanning v r , H r and their neighbors. 

For example, in Figure the original latent tree T p is shown in Figure [T47a). and 
T°, T , T 2 are shown in Figure fT47 b). The set of internal nodes is X = {3,5}. In the 
first iteration, v 1 = 5, A 1 = {1,3,4,5} and H 1 = {hi,h2\. In the second iteration, v 2 = 
3, A 2 = {2,3,6,/n} and H 1 = {h 3 }. V° = {3,5}, V 1 = {3}, and V 2 = 0, and in 
Figure HKc), we show EC{T P ,V°), EC{T p ,V l ), and EC(T P ,V 2 ). In EC^T/ 1 ), S l is the 
subtree spanning 5,hi,hi and their neighbors, i.e., {1, 3, 4, 5, hi, fo}- In EC(T P ,V 2 ), S 2 
is the subtree spanning 3,/i3 and their neighbors, i.e., {2, 3, 6, hi, h^}. Note that T° = 
EC(T P , V°), T 1 = EC(T P , V 1 ), and T 2 = EC(T P , V 2 ); we show below that this holds for all 
CLGrouping iterations in general. 

We prove the theorem by induction on the iterations r = 1, . . . , \I\ of the CLGrouping 
algorithm. 

Induction Hypothesis: At the end of k iterations of CLGrouping, the tree obtained is 



In words, the latent tree after k iterations of CLGrouping can be constructed by contracting 
each surrogate node in T p that has not been visited by CLGrouping with its inverse surrogate 
set. Note that = and EC(T p ,V^) is equivalent to the original latent tree T p . Thus, 



Base Step r = 0: The claim in (I4ip holds since V° = X and the input to the CLGrouping 
procedure is the Chow-Liu tree MST(V;D), which is obtained by contracting all surrogate 
nodes and their inverse surrogate sets (see Section 15. 2\ . 

Induction Step: Assume (|4ip is true for k = 1, . . . , r — 1. Now consider k = r. 
We first compare the two latent trees EC(T p ,V r ) and EC(T p ,V r ~ l ). By the definition 
of EC, if we contract edges with v r and the hidden nodes in its inverse surrogate set H r 
on the tree EC(T„,V r ), then we obtain EC(T p ,V r ~ 1 ), which is equivalent to T r_1 by the 
induction assumption. Note that as shown in Figure [T4l this transformation is local to the 
subtree S r : contracting v r with H r on EC(T p , V r ) transforms S r into a star graph with v r 
at its center and the hidden nodes H r removed (contracted with v r ). 

Recall that the CLGrouping procedure replaces the induced subtree of A r in T r ~ x (which 
is precisely the star graph mentioned above by the induction hypothesis) with RG(A r ,d) 
to obtain T r . Thus, to prove that T r = EC(T p , V r ), we only need to show that RG reverses 
the edge-contraction operations on v r and H r , that is, the subtree S r = RG(^4 r ,d). We 
first show that S r £ 7>3, i.e., it is identifiable (minimal) when A r is the set of visible nodes. 
This is because an edge contraction operation does not decrease the degree of any existing 
nodes. Since T p £ 7>3, all hidden nodes in EC(T p , V r ) have degrees equal to or greater 
than 3, and since we are including all neighbors of H r in the subtree S r , we have S r £ 7>3- 
By Theorem [5l RG reconstructs all latent trees in 7>3 and hence, S r = RG(A r ,d). 



T k = EC(T p ,V k ) 



Vk = 0,1,... . 



(41) 




is the original 
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The computational complexity follows from the corresponding result in recursive group- 
ing. The Chow-Liu tree can be constructed with 0(m 2 log m) complexity. The recursive 
grouping procedure has complexity max r |A r | 3 and max r \A r \ < A(MST(V; gQ). □ 

A. 6 Proof of Theorem lilt Consistency and Sample Complexity of Relaxed RG 

(i) Structural consistency follows from Theorem [5] and the fact that the ML estimates of 
information distances dij approach dij (in probability) for all i,j G V as the number of 
samples tends to infinity. 

Risk consistency for Gaussian and symmetric discrete distributions follows from struc- 
tural consistency. If the structure is correctly recovered, we can use the equations in (|13p 
and (|14p to infer the information distances. Since the distances are in one-to-one corre- 
spondence to the correlation coefficients and the crossover probability for Gaussian and 
symmetric discrete distribution respectively, the parameters are also consistent. This im- 
plies that the KL-divergence between p and p" 1 tends to zero (in probability) as the number 
of samples n tends to infinity. This completes the proof. 

(ii) The theorem follows by using the assumption that the effective depth 5 = 5(T p ; V) 
is constant. Recall that r > is the threshold used in relaxed RG (see (|24|) in Section TG.ip , 
Let the set of triples (i,j,k) whose pairwise information distances are less than r apart be 
J, i.e., (i,j,k) E J if and only if maxjdjj, djk, d^i} < r. Since we assume that the true 
information distances are uniformly bounded, there exist r > and some sufficiently small 
A > so that if \<&ijk — &ijk\ < A for all k) G J ~, then RG recovers the correct latent 
structure. 

Define the error event := {I'&jjfc — &ijk\ > We note that the probability of the 
event Eijk decays exponentially fast, i.e., there exists > such that for all n G N, 

Pr(%fc) < exp(-nJ ijfe ). (42) 



The proof of (142(1 follows readily for Chernoff bounds (jHoeffdine) . Il958l ) and is omitted. The 



error probability associated to structure learning can be bounded as follows: 

Pr (h(f n ) ^ T p ) < Pr |J £ ijk \ < ^ Pr(^ fc ) (43) 

\(i,j,k)ej J (i,j,k)ej 



(c) 

< m max Vx{£ijk) < exp(3 log m) exp 



-n min Jj,-fc 

(i,j,k)ej 



(44) 



where (a) follows from the fact that if the event {h(T n ) ^ T p } occurs, then there is at least 
one sibling or parent-child relationship that is incorrect, which corresponds to the union of 
the events Sijk, i-e., there exists a triple (i,j,k) € J is such that ^ijt differs from by 
more than A. Inequality (b) follows from the union bound and (c) follows from (|42p . 

Because the information distances are uniformly bounded, there also exists a constant 
<Anin > (independent of m) such that miauj ja^j Jijk > </ m in for all m G N. Hence 
for every rj > 0, if the number of samples satisfies n > 3(log(m/^/r/))/J m i n , the error 
probability is bounded above by rj. Let C := 3/J m ; n to complete the proof of the sample 
complexity result in (j29|) . The proof for the logarithmic sample complexity of distribution 
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reconstruction for Gaussian and symmetric discrete models follows from the logarithmic 
sample complexity result for structure learning and the fact that the information distances 
are in a one-to-one correspondence with the correlation coefficients (for Gaussian models) 
or crossover probabilities (for symmetric discrete models). 

A. 7 Proof of Theorem I12t Consistency and Sample Complexity of Relaxed 
CLRG 

(i) Structural consistency of CLGrouping follows from structural consistency of RG (or 
NJ) and the consistency of the Chow-Liu algorithm. Risk consistency of CLGrouping for 
Gaussian or symmetric distributions follows from the structural consistency, and the proof 
is similar to the proof of Theorem [TTTi) . 

(ii) The input t o the CLGroupin g procedure Tql is the Chow-Liu tree and has 0(log m) 
sample complexity Tan et al. ( 2010l ). where m is the size of the tree. From Theorem II 1| 



the recursive grouping procedure has O(logm) sample complexity (for appropriately chosen 
thresholds) when the input information distances are uniformly bounded. In any iteration 
of the CLGrouping, the information distances satisfy dij < ju, where 7, defined in (|37p . is 
the worst-case graph distance of any hidden node from its surrogate. Since 7 satisfies (1391) . 
dij < u 2 5/l. If the effective depth 5 = O(l) (as assumed), the distances dij = 0(1) and the 
sample complexity is O(logm). □ 
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